Statistical report of test project GeneLab Ecotypes in space:

pairwise analysis of conditions

with SARTools' edgeR library


Author: Richard Barker

Date: 2018-08-18


Table of contents

  1. Introduction
  2. Description of raw data
  3. Filtering low counts
  4. Variability within the experiment: data exploration
  5. Normalization
  6. Differential gene expression analysis
  7. R session information and parameters
  8. Bibliography

1 Introduction

The analyses reported in this document are part of the GeneLab Ecotypes in space project. The aim is to find features that are differentially expressed between Col, Cvi, Ler and WS2. The statistical analysis process includes data normalization, graphical exploration of raw and normalized data, test for differential expression for each feature between the conditions, raw p-value adjustment and export of lists of features having a significant differential expression between the conditions.

The analysis is performed using the R software [R Core Team, 2014], Bioconductor [Gentleman, 2004] packages including edgeR [Robinson, 2010] and the SARTools package developed at PF2 - Institut Pasteur. Normalization and differential analysis are carried out according to the edgeR model and package. This report comes with additional tab-delimited text files that contain lists of differentially expressed features.

For more details about the edgeR methodology, please refer to its related publications [Robinson, 2007, 2008, 2010 and McCarthy, 2012].


2 Description of raw data

The count data files and associated biological conditions are listed in the following table.

Table 1: Data files and associated biological conditions.
SampleLabel File Treatment Genotype
Col_Ground_1 DRB_RNAseq_counts_2018_Ecotypes_BRIC19_v5.txt Ground Col
Col_Ground_2 DRB_RNAseq_counts_2018_Ecotypes_BRIC19_v5.txt Ground Col
Col_Ground_3 DRB_RNAseq_counts_2018_Ecotypes_BRIC19_v5.txt Ground Col
Col_Ground_4 DRB_RNAseq_counts_2018_Ecotypes_BRIC19_v5.txt Ground Col
Cvi_Ground_1 DRB_RNAseq_counts_2018_Ecotypes_BRIC19_v5.txt Ground Cvi
Cvi_Ground_2 DRB_RNAseq_counts_2018_Ecotypes_BRIC19_v5.txt Ground Cvi
Cvi_Ground_3 DRB_RNAseq_counts_2018_Ecotypes_BRIC19_v5.txt Ground Cvi
Ler_Ground_1 DRB_RNAseq_counts_2018_Ecotypes_BRIC19_v5.txt Ground Ler
Ler_Ground_2 DRB_RNAseq_counts_2018_Ecotypes_BRIC19_v5.txt Ground Ler
Ler_Ground_3 DRB_RNAseq_counts_2018_Ecotypes_BRIC19_v5.txt Ground Ler
WS2_Ground_1 DRB_RNAseq_counts_2018_Ecotypes_BRIC19_v5.txt Ground WS2
WS2_Ground_2 DRB_RNAseq_counts_2018_Ecotypes_BRIC19_v5.txt Ground WS2
WS2_Ground_3 DRB_RNAseq_counts_2018_Ecotypes_BRIC19_v5.txt Ground WS2
WS2_Ground_4 DRB_RNAseq_counts_2018_Ecotypes_BRIC19_v5.txt Ground WS2

After loading the data we first have a look at the raw data table itself. The data table contains one row per annotated feature and one column per sequenced sample. Row names of this table are feature IDs (unique identifiers). The table contains raw count values representing the number of reads that map onto the features. For this project, there are 41671 features in the count data table.

Table 2: Partial view of the count data table.
Col_Ground_1 Col_Ground_2 Col_Ground_3 Col_Ground_4 Cvi_Ground_1 Cvi_Ground_2 Cvi_Ground_3 Ler_Ground_1 Ler_Ground_2 Ler_Ground_3 WS2_Ground_1 WS2_Ground_2 WS2_Ground_3 WS2_Ground_4
AT1G01010.1 303 285 131 332 551 271 154 342 96 25 271 93 44 32
AT1G01020.1 84 40 97 70 113 67 32 87 27 6 26 9 6 5
AT1G01020.2 83 32 93 60 108 65 33 81 21 6 22 9 6 4
AT1G01030.1 79 40 30 36 11 12 7 17 26 2 49 14 12 14
AT1G01040.1 477 324 345 517 850 411 269 414 242 20 500 149 120 64
AT1G01040.2 447 303 329 475 801 374 254 367 226 18 470 128 117 59

Looking at the summary of the count table provides a basic description of these raw counts (min and max values, median, etc).

Table 3: Summary of the raw counts.
Col_Ground_1 Col_Ground_2 Col_Ground_3 Col_Ground_4 Cvi_Ground_1 Cvi_Ground_2 Cvi_Ground_3 Ler_Ground_1 Ler_Ground_2 Ler_Ground_3 WS2_Ground_1 WS2_Ground_2 WS2_Ground_3 WS2_Ground_4
Min. 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1st Qu. 1 1 0 0 1 0 0 1 0 0 0 0 0 0
Median 34 28 20 38 48 26 15 34 14 2 27 9 5 4
Mean 379 319 262 449 495 262 177 372 189 48 362 152 66 50
3rd Qu. 195 157 112 213 266 148 88 185 81 14 151 50 32 22
Max. 2003292 1946182 1419028 2259068 1565851 905457 412651 1677741 970559 599191 1742480 1936451 312860 317225

Figure 1 shows the total number of mapped reads for each sample. Reads that map on multiple locations on the transcriptome are counted more than once, as far as they are mapped on less than 50 different loci. We expect total read counts to be similar within conditions, they may be different across conditions. Total counts sometimes vary widely between replicates. This may happen for several reasons, including:

Barplot total counts
Figure 1: Number of mapped reads per sample. Colors refer to the biological condition of the sample.

Figure 2 shows the proportion of features with no read count in each sample. We expect this proportion to be similar within conditions. Features with null read counts in the 14 samples will not be taken into account for the analysis with edgeR. Here, 3757 features (9.02%) are in this situation (dashed line).

Barplot null counts
Figure 2: Proportion of features with null read counts in each sample.

Figure 3 shows the distribution of read counts for each sample. For sake of readability, \(\text{log}_2(\text{counts}+1)\) are used instead of raw counts. Again we expect replicates to have similar distributions. In addition, this figure shows if read counts are preferably low, medium or high. This depends on the organisms as well as the biological conditions under consideration.

Estimated densities of raw counts
Figure 3: Density distribution of read counts.

It may happen that one or a few features capture a high proportion of reads (up to 20% or more). This phenomenon should not influence the normalization process. The edgeR normalization has proved to be robust to this situation [Dillies, 2012]. Anyway, we expect these high count features to be the same across replicates. They are not necessarily the same across conditions. Figure 4 illustrate the possible presence of such high count features in the data set.

Most represented sequences
Figure 4: Percentage of reads associated with the sequence having the highest count (provided in each box on the graph) for each sample.

We may wish to assess the similarity between samples across conditions. A pairwise scatter plot is produced (figure 5) to show how replicates and samples from different biological conditions are similar or different (\(\text{log}_2(\text{counts}+1)\) are used instead of raw count values). Moreover, as the Pearson correlation has been shown not to be relevant to measure the similarity between replicates, the SERE statistic has been proposed as a similarity index between RNA-Seq samples [Schulze, 2012]. It measures whether the variability between samples is random Poisson variability or higher. Pairwise SERE values are printed in the lower triangle of the pairwise scatter plot. The value of the SERE statistic is:

Pairwise scatter plot
Figure 5: Pairwise comparison of samples.

3 Filtering low counts

edgeR suggests to filter features with null or low counts because they do not supply much information. For this project, 18432 features (44.23%) have been removed from the analysis because they did not satisfy the following condition: having at least 1 counts-per-million in at least 3 samples.


4 Variability within the experiment: data exploration

The main variability within the experiment is expected to come from biological differences between the samples. This can be checked in three ways. The first one is to perform a hierarchical clustering of the whole sample set. This is performed after a transformation of the count data as moderated log-counts-per-million. Figure 6 shows the dendrogram obtained from CPM data. An euclidean distance is computed between samples, and the dendrogram is built upon the Ward criterion. We expect this dendrogram to group replicates and separate biological conditions.

Clustering
Figure 6: Sample clustering based on normalized data.

The second method of visaulizing the experiment variability is to look at the heatmaps of the two conditions as show on figure 7. On this figure the x-axis represents the two conditions (along with the replicates) and the y-axis represent the top 20 genes with the top variance over all samples.

Heatmap
Figure 7: Heatmap based on normalized data.

Another way of visualizing the experiment variability is to look at the first two dimensions of a multidimensional scaling plot, as shown on figure 8. On this figure, the first dimension is expected to separate samples from the different biological conditions, meaning that the biological variability is the main source of variance in the data.

Multidimensional scaling plot
Figure 8: Multidimensional scaling plot of the samples.

5 Normalization

Normalization aims at correcting systematic technical biases in the data, in order to make read counts comparable across samples. The normalization proposed by edgeR is called Trimmed Mean of M-values (TMM) but it is also possible to use the RLE (DESeq) or upperquartile normalizations. It relies on the hypothesis that most features are not differentially expressed.

edgeR computes a factor for each sample. These normalization factors apply to the total number of counts and cannot be used to normalize read counts in a direct manner. Indeed, normalization factors are used to normalize total counts. These in turn are used to normalize read counts according to a total count normalization: if \(N_j\) is the total number of reads of the sample \(j\) and \(f_j\) its normalization factor, \(N'_j=f_j \times N_j\) is the normalized total number of reads. Then, let \(s_j=N'_j/\bar{N'}\) with \(\bar{N'}\) the mean of the \(N'_j\) s. Finally, the normalized counts of the sample \(j\) are defined as \(x'_{ij}=x_{ij}/s_j\) where \(i\) is the gene index.

Table 5: Normalization factors.
1 2 3 4 5 6 7 8 9 10 11 12 13 14
TMM normalization factors 1.13 1.08 0.94 1.07 1.20 1.26 1.10 1.09 0.94 0.65 0.94 0.75 1.04 1.01

Boxplots are often used to assess the quality of the normalization process, as they show how distributions are globally affected during this process. We expect normalization to stabilize distributions across samples. Figure 9 shows boxplots of raw (left) and normalized (right) data respectively.

Boxplots of raw and normalized counts
Figure 9: Boxplots of raw (left) and normalized (right) read counts.

6 Differential analysis

6.1 Modelization

edgeR aims at fitting one linear model per feature. For this project, the design used is ~ Genotype and the goal is to estimate the models' coefficients which can be interpreted as \(\log_2(\texttt{FC})\). These coefficients will then be tested to get p-values and adjusted p-values.

6.2 Dispersions estimation

The edgeR model assumes that the count data follow a negative binomial distribution which is a robust alternative to the Poisson law when data are over-dispersed (the variance is higher than the mean). The first step of the statistical procedure is to estimate the dispersion of the data.

Dispersions estimations
Figure 10: Dispersion estimates.

Figure 10 shows the result of the dispersion estimation step. The x- and y-axes represent the mean count value and the estimated dispersion respectively. Black dots represent empirical dispersion estimates for each feature (from the observed count values). The blue curve shows the relationship between the means of the counts and the dispersions modeled with splines. The red segment represents the common dispersion.

6.3 Statistical test for differential gene expression

Once the dispersion estimation and the model fitting have been done, edgeR can perform the statistical testing. Figure 11 shows the distributions of raw p-values computed by the statistical test for the comparison(s) done. This distribution is expected to be a mixture of a uniform distribution on \([0,1]\) and a peak around 0 corresponding to the differentially expressed features.

Histogram(s) of raw p-values
Figure 11: Distribution(s) of raw p-values.

6.4 Final results

A p-value adjustment is performed to take into account multiple testing and control the false positive rate to a chosen level \(\alpha\). For this analysis, a BH p-value adjustment was performed [Benjamini, 1995 and 2001] and the level of controlled false positive rate was set to 0.05.

Figure 12 represents the MA-plot of the data for the comparisons done, where differentially expressed features are highlighted in red. A MA-plot represents the log ratio of differential expression as a function of the mean intensity for each feature. Triangles correspond to features having a too low/high \(\log_2(\text{FC})\) to be displayed on the plot.

MA-plot(s)
Figure 12: MA-plot(s) of each comparison. Red dots represent significantly differentially expressed features.

Figure 13 shows the volcano plots for the comparisons performed and differentially expressed features are still highlighted in red. A volcano plot represents the log of the adjusted P value as a function of the log ratio of differential expression.

Volcano plot(s)
Figure 13: Volcano plot(s) of each comparison. Red dots represent significantly differentially expressed features.

Full results as well as lists of differentially expressed features are provided in the following text files which can be easily read in a spreadsheet. For each comparison:

These files contain the following columns:


7 R session information and parameters

The versions of the R software and Bioconductor packages used for this analysis are listed below. It is important to save them if one wants to re-perform the analysis in the same conditions.

Parameter values used for this analysis are:


8 Bibliography